First, we again create 1 complete dataset, which is done by means of imputing the boys dataset in mice (Van Buuren & Groothuis-Oudshoorn, 2011) with default settings, and complete the dataset. Then, based on this one true dataset, we extract the regression coefficients of a linear regression model in which weight is regressed on age and height, that are used to compare the estimates of the synthetic versions of the datasets. The true regression coefficients are equal to \(\beta_0 = -9.216\), \(\beta_{age} = 2.327\), \(\beta_{height} = 0.191\). Then, for a total of 200 iterations, we impute a matrix with the same dimensions as the boys dataset (that is, the sample size \(n = 748\) and the number of predictors \(k = 9\)). This is done once with the default settings, that is, with pmm for the continuous variables, polr for binary variables and polyreg for ordinal variables with more than two categories. Furthermore, bmi is imputed passively, since it consists of a fixed combination of height and weight bmi = (wgt / (hgt/100)^2. Furthermore, the predictor matrix is adjusted so that imputations for bmi do not flow back into predictions of hgt and wgt. Then, for every synthetic dataset, the estimates, and corresponding variance of the estimates are calculated, as well as the \(95\%~CI\) are calculated. Furthermore, an indicator is added, whether or not the confidence interval includes the true parameter estimates. Note that these confidence intervals are still based on imputations for missing data, instead of imputations for synthetic data. This is due to the fact that the calculation for the degrees of freedom for the estimates based imputing synthetic values might yield degrees of freedom that are so small that the corresponding \(95\% ~ CI\) yields boundary values of \(-\infty\) and \(\infty\).
broom::tidy(truemodel, conf.int=TRUE) %>%
mutate(CIW = conf.high - conf.low) %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| term | estimate | std.error | statistic | p.value | conf.low | conf.high | CIW |
|---|---|---|---|---|---|---|---|
| (Intercept) | -9.216 | 2.033 | -4.534 | 0 | -13.207 | -5.226 | 7.982 |
| age | 2.327 | 0.189 | 12.335 | 0 | 1.956 | 2.697 | 0.741 |
| hgt | 0.191 | 0.028 | 6.843 | 0 | 0.136 | 0.246 | 0.110 |
default_fit <- syns_def %>%
map(function(x) x %$% lm(wgt ~ age + hgt) %>% pool %>% summary) %>%
map_dfr(function(x) {
var <- x[,1]
true_est <- coefs
est <- x[,2]
true_se <- sqrt(diag(vcov(truemodel)))
se <- x[,3]
df <- x[,5]
lower <- est + se * qt(.025, df)
upper <- est + se * qt(.975, df)
cov <- lower < coefs & coefs < upper
bind_cols(var = var, true_est = true_est, est = est, true_se = true_se,
se = se, df = df, lower = lower, upper = upper, cov = cov)
})
results_def <- default_fit %>%
group_by(var) %>%
summarise("True Est" = unique(true_est),
"Syn Est" = mean(est),
"Bias" = mean(est - true_est),
"True SE" = unique(true_se),
"Syn SE" = mean(se),
"df" = mean(df),
"Lower" = mean(lower),
"Upper" = mean(upper),
"CIW" = mean(upper - lower),
"Coverage" = mean(cov))cart_fit <- syns_cart %>%
map(function(x) x %$% lm(wgt ~ age + hgt) %>% pool %>% summary) %>%
map_dfr(function(x) {
var <- x[,1]
true_est <- coefs
est <- x[,2]
true_se <- sqrt(diag(vcov(truemodel)))
se <- x[,3]
df <- x[,5]
lower <- est + se * qt(.025, df)
upper <- est + se * qt(.975, df)
cov <- lower < coefs & coefs < upper
bind_cols(var = var, true_est = true_est, est = est, true_se = true_se,
se = se, df = df, lower = lower, upper = upper, cov = cov)
})
results_cart <- cart_fit %>%
group_by(var) %>%
summarise("True Est" = unique(true_est),
"Syn Est" = mean(est),
"Bias" = mean(est - true_est),
"True SE" = unique(true_se),
"Syn SE" = mean(se),
"df" = mean(df),
"Lower" = mean(lower),
"Upper" = mean(upper),
"CIW" = mean(upper - lower),
"Coverage" = mean(cov))bind_rows("Mice default" = results_def,
"Mice with cart" = results_cart, .id = "Imputation method") %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Imputation method | var | True Est | Syn Est | Bias | True SE | Syn SE | df | Lower | Upper | CIW | Coverage |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mice default | (Intercept) | -9.216 | -2.160 | 7.057 | 2.033 | 2.941 | 28.084 | -8.657 | 4.338 | 12.995 | 0.338 |
| Mice default | age | 2.327 | 2.948 | 0.621 | 0.189 | 0.285 | 18.624 | 2.304 | 3.591 | 1.287 | 0.490 |
| Mice default | hgt | 0.191 | 0.094 | -0.097 | 0.028 | 0.041 | 22.553 | 0.003 | 0.186 | 0.183 | 0.388 |
| Mice with cart | (Intercept) | -9.216 | -7.632 | 1.585 | 2.033 | 2.691 | 43.924 | -13.312 | -1.952 | 11.361 | 1.000 |
| Mice with cart | age | 2.327 | 2.460 | 0.133 | 0.189 | 0.259 | 37.111 | 1.904 | 3.015 | 1.112 | 1.000 |
| Mice with cart | hgt | 0.191 | 0.170 | -0.021 | 0.028 | 0.038 | 37.963 | 0.089 | 0.251 | 0.162 | 1.000 |
For now, we forget about the default mice imputation method, because it leads to flawed results. Therefore, we continue with the CART results only. We fit the lm model in which wgt is regressed on age and hgt.
## pool.lm.syn
## If you have a multiply imputed synthetic dataset, that is, multiple fully imputed
## datasets that are generated by mice, using an all-ones matrix as the where matrix,
## this function can be used to analyse the synthetic data by means of a linear
## regression model, and to pool the results.
pool.syn <- function(mira) {
if(class(mira)[1] == "mira") {
fitlist <- mira %$% analyses
}
else {
fitlist <- mira
}
m <- length(fitlist)
pooled <- fitlist %>%
map_dfr(broom::tidy) %>%
group_by(term) %>%
summarise(est = mean(estimate),
bm = sum((estimate - est)^2) / (m - 1),
ubar = mean(std.error^2),
var_u = (1 + 1/m) * bm - ubar,
var = if_else(var_u > 0, var_u, ubar),
df = max(1, (m - 1) * (1 - ubar / (bm + bm/m))^2),
lower = est - qt(.975, df) * sqrt(var),
upper = est + qt(.975, df) * sqrt(var), .groups = 'drop')
pooled
}
ci_cov <- function(pooled, true_fit) {
coefs <- coef(true_fit)
vars <- diag(vcov(true_fit))
nsim <- nrow(pooled) / length(coefs)
pooled %>% mutate(true_coef = rep(coefs, nsim),
true_var = rep(vars, nsim),
cover = lower < true_coef & true_coef < upper) %>%
group_by(term) %>%
summarise("True Est" = unique(true_coef),
"Syn Est" = mean(est),
"Bias" = mean(est - true_coef),
"True SE" = unique(sqrt(true_var)),
"Syn SE" = mean(sqrt(var)),
"df" = mean(df),
"Lower" = mean(lower),
"Upper" = mean(upper),
"CIW" = mean(upper - lower),
"Coverage" = mean(cover), .groups = "drop")
}Then, we use the custom pooling function to pool the estimates.
And another custom function to summarise all results, and obtain the confidence interval coverage.
ci_cov(pooled, truemodel) %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| term | True Est | Syn Est | Bias | True SE | Syn SE | df | Lower | Upper | CIW | Coverage |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | -9.216 | -7.632 | 1.585 | 2.033 | 1.770 | 32.803 | -20.650 | 5.386 | 26.036 | 0.996 |
| age | 2.327 | 2.460 | 0.133 | 0.189 | 0.164 | 67.320 | 1.154 | 3.765 | 2.610 | 1.000 |
| hgt | 0.191 | 0.170 | -0.021 | 0.028 | 0.024 | 44.089 | -0.020 | 0.359 | 0.379 | 0.998 |
It can be seen that the coverage of the confidence interval is nearly equal to one. Since we currently only want to make inferences with regard to the sample, this is okay. Namely, if the CI coverage with regard to the sample would be \(95\%\), the CI coverage with regard to the population would be \(0.9025\%\). Inferences with regard to the population will be discussed in the next paragraph.
However, what is questionable, is the fact that the confidence interval width has increased, even though the variance has decreased. Since the confidence interval width is a function of only the standard error and the degrees of freedom, this must be due to the fact that the degrees of freedom are smaller than the degrees of freedom as returned by the mice pooling function.
mice_df <- data.frame(term = cart_fit$var, df = cart_fit$df)
syn_df <- data.frame(term = pooled$term, df = pooled$df)
df <- bind_rows("Mice" = mice_df, "Synthetic" = syn_df, .id = "Method")
ggplot(data = df, aes(x = df, color = Method, fill = Method)) +
geom_density(alpha = .7) +
facet_wrap(~ term, nrow = 1, scales = "free") +
scale_color_viridis_d(begin = .30, end = .65) +
scale_fill_viridis_d(begin = .30, end = .65) +
theme_classic()Now we have established that the mice synthesizing algorithm works as good as the synthpop synthesizing algorithm, we can check whether changing the order in which hgt and wgt are synthesized influences the result of the algorithm.
cart_wgt_hgt %>%
map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
map_dfr(pool.syn) %>%
ci_cov(truemodel) %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| term | True Est | Syn Est | Bias | True SE | Syn SE | df | Lower | Upper | CIW | Coverage |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | -9.216 | -7.605 | 1.612 | 2.033 | 1.776 | 35.370 | -20.822 | 5.612 | 26.434 | 1.000 |
| age | 2.327 | 2.463 | 0.137 | 0.189 | 0.165 | 48.288 | 1.109 | 3.818 | 2.709 | 0.998 |
| hgt | 0.191 | 0.169 | -0.022 | 0.028 | 0.024 | 32.164 | -0.023 | 0.362 | 0.385 | 0.998 |
true_results <- bootstrap_boys %>%
map(~ lm(wgt ~ age + hgt, .x)) %>%
map_dfr(~ broom::tidy(.x, conf.int = TRUE)) %>%
mutate(true_coef = rep(coef(truemodel), nsim),
true_se = rep(sqrt(diag(vcov(truemodel))), nsim),
cover = conf.low < true_coef & true_coef < conf.high) %T>%
assign("boot_true", ., envir = globalenv()) %>%
group_by(term) %>%
summarise("True Est" = unique(true_coef),
"Boot Est" = mean(estimate),
"Bias" = mean(estimate - true_coef),
"True SE" = unique(true_se),
"Boot SE" = mean(std.error),
"DF" = 745,
"Lower" = mean(conf.low),
"Upper" = mean(conf.high),
"CIW" = mean(conf.high - conf.low),
"Coverage" = mean(cover)) %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
synthetic_results <- boot_cart %>%
map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
map_dfr(pool.syn) %T>%
assign("boot_syn", ., envir = globalenv()) %>%
ci_cov(truemodel) %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
norm_results <- boot_cart %>%
map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
map_dfr(pool.syn) %>%
mutate(df = NA,
lower = est - qnorm(.975) * sqrt(var),
upper = est + qnorm(.975) * sqrt(var)) %>%
ci_cov(truemodel) %>%
kable(digits = 3, caption = "Normal CI approximation") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| term | True Est | Boot Est | Bias | True SE | Boot SE | DF | Lower | Upper | CIW | Coverage |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | -9.216 | -9.301 | -0.084 | 2.033 | 2.025 | 745 | -13.275 | -5.326 | 7.949 | 0.934 |
| age | 2.327 | 2.314 | -0.013 | 0.189 | 0.188 | 745 | 1.945 | 2.683 | 0.738 | 0.932 |
| hgt | 0.191 | 0.193 | 0.001 | 0.028 | 0.028 | 745 | 0.138 | 0.247 | 0.109 | 0.922 |
| term | True Est | Syn Est | Bias | True SE | Syn SE | df | Lower | Upper | CIW | Coverage |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | -9.216 | -8.101 | 1.116 | 2.033 | 1.809 | 141.821 | -19.147 | 2.946 | 22.092 | 0.962 |
| age | 2.327 | 2.419 | 0.092 | 0.189 | 0.164 | 74.606 | 1.315 | 3.523 | 2.208 | 0.966 |
| hgt | 0.191 | 0.176 | -0.015 | 0.028 | 0.024 | 86.320 | 0.021 | 0.331 | 0.310 | 0.968 |
| term | True Est | Syn Est | Bias | True SE | Syn SE | df | Lower | Upper | CIW | Coverage |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | -9.216 | -8.101 | 1.116 | 2.033 | 1.809 | NA | -11.646 | -4.555 | 7.090 | 0.842 |
| age | 2.327 | 2.419 | 0.092 | 0.189 | 0.164 | NA | 2.098 | 2.740 | 0.643 | 0.824 |
| hgt | 0.191 | 0.176 | -0.015 | 0.028 | 0.024 | NA | 0.129 | 0.223 | 0.094 | 0.810 |
We find that some bias has been introduced due to synthesizing the relationships, so we investigate this in further depth. Additionally, the coverage is somewhat too high, and the average confidence interval width has increased dramatically. Therefore, we will further investigate where this increase stems from.
plot_dat <- bind_rows("Synthetic" = boot_syn %>%
mutate(ciw = upper - lower) %>%
select(term, est, ciw = ciw),
"Real values" = boot_true %>%
mutate(ciw = conf.high - conf.low) %>%
select(term, est = estimate, ciw = ciw),
.id = "Method")
ggplot(plot_dat, aes(x = est, color = Method, fill = Method)) +
geom_density(alpha = .7) +
facet_wrap(~ term, nrow = 1, scales = "free") +
scale_color_viridis_d(begin = .30, end = .65) +
scale_fill_viridis_d(begin = .30, end = .65) +
theme_classic()It can be seen that the bias is systematic, and thus not due to any extreme values that pull the average estimate to zero. Additionally, it seems that the pooled standard error indicated in the table is somewhat too low for the synthetic data. Namely, it appears in the plot that the spread in the synthetic data is roughly equal to the spread in the bootstrapped, but real, data. However, in the table above, the synthetic standard errors are much smaller than the mean over the manually calculated standard error of the bootstrapped estimates. When the standard errors are calculated over the obtained regression estimates, we find that the synthetic standard error increases much more when we calculate it over the observed synthetic estimates.
boot_se <- boot_true %>% group_by(term) %>% summarise("Bootstrapped SE" = sd(estimate))
boot_syn_se <- boot_syn %>% group_by(term) %>% summarise("Synthetic bootstrapped SE" = sd(est))
bind_cols(boot_se, boot_syn_se[,2]) %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| term | Bootstrapped SE | Synthetic bootstrapped SE |
|---|---|---|
| (Intercept) | 2.105 | 2.046 |
| age | 0.207 | 0.202 |
| hgt | 0.030 | 0.030 |
ggplot(plot_dat, aes(x = ciw, color = Method, fill = Method)) +
geom_density(alpha = .7) +
facet_wrap(~ term, nrow = 1, scales = "free") +
scale_color_viridis_d(begin = .30, end = .65) +
scale_fill_viridis_d(begin = .30, end = .65) +
theme_classic()When we look at the confidence interval width, we find that the bootstrap CI is of approximately the same length in all iterations, while the CI’s of the synthetic results spread out much more.
The fact that there is bias introduced, might be due to the fact that there is too little variance in the estimates of the synthetic values. Therefore, an approach that increases the variance of the estimates might yield estimates that display less bias. This can be achieved by decreasing the number of iterations, that is, setting the mice parameter maxit to 1, instead of 5. Additionally, the complexity parameter can be set to \(10^{-8}\) instead of \(10^{-4}\), so that more splits will be made. Also, the value of the rpart parameter minbucket, that is, the minimum number of observations in any terminal node, is set to 3 instead of 5, so that, once again, more splits can be made. By means of these changes, the within variance increases, while the between-imputations variability decreases. The results with 500 iterations on the same, singly imputed boys dataset can be seen below.
syns_cart_maxit_cp_min3 %>%
map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
map_dfr(pool.syn) %>%
ci_cov(., truemodel) %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| term | True Est | Syn Est | Bias | True SE | Syn SE | df | Lower | Upper | CIW | Coverage |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | -9.216 | -8.357 | 0.859 | 2.033 | 1.825 | 63.626 | -19.388 | 2.674 | 22.062 | 1 |
| age | 2.327 | 2.406 | 0.080 | 0.189 | 0.167 | 84.422 | 1.224 | 3.589 | 2.366 | 1 |
| hgt | 0.191 | 0.179 | -0.012 | 0.028 | 0.025 | 64.223 | 0.012 | 0.346 | 0.334 | 1 |
It can be seen that the bias indeed decreases, while the standard error of the estimates increases. This is due to the fact that the within-imputation variance increases, while the between-imputation variance decreases somewhat. When we use the same approach to estimate the bootstrapped boys data, we get the following results.
boot_cart_maxit_cp_min3 %>%
map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
map_dfr(pool.syn) %>%
ci_cov(., truemodel) %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| term | True Est | Syn Est | Bias | True SE | Syn SE | df | Lower | Upper | CIW | Coverage |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | -9.216 | -9.051 | 0.165 | 2.033 | 1.968 | 330.899 | -16.740 | -1.362 | 15.379 | 0.962 |
| age | 2.327 | 2.336 | 0.009 | 0.189 | 0.178 | 213.036 | 1.490 | 3.181 | 1.691 | 0.960 |
| hgt | 0.191 | 0.189 | -0.002 | 0.028 | 0.027 | 236.766 | 0.066 | 0.312 | 0.246 | 0.962 |
cart_default <- boot_cart %>%
map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
map_dfr(pool.syn) %>%
group_by(term) %>%
summarise(Within = mean(ubar), Between = mean(bm), Total = mean(var))## `summarise()` ungrouping output (override with `.groups` argument)
cart_adj <- boot_cart_maxit_cp_min3 %>%
map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
map_dfr(pool.syn) %>%
group_by(term) %>%
summarise(Within = mean(ubar), Between = mean(bm), Total = mean(var))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
## term Within Between Total
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.89 2.14 3.45
## 2 age 0.0332 0.0222 0.0288
## 3 hgt 0.000727 0.000458 0.000623
## # A tibble: 3 x 4
## term Within Between Total
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.05 1.39 3.94
## 2 age 0.0349 0.0142 0.0326
## 3 hgt 0.000764 0.000298 0.000731
Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3), 1–67. Retrieved from https://www.jstatsoft.org/v45/i03/